Quantitative parameters in corpus design: Estimating the optimum text size in Modern Greek language

نویسنده

  • George K. Mikros
چکیده

The aim of this paper is to investigate the major quantitative parameters related to the definition of the optimum text size in Modern Greek corpus development. Using the Hellenic National Corpus (HNC) (Hatzigeorgiu et al., 2000) as a reference point we estimated a number of critical statistical measures regarding feature counting in different text sizes. The results indicate that frequent linguistic features behave differently from the medium frequency and the rare ones and the text size increase do not affect them uniformly.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Design and Implementation of the Online ILSP Greek Corpus

This paper presents the Hellenic National (HNC), which is the corpus of Modern Greek developed by the Institute for Language and Speech Processing (ILSP). The presentation describes all stages of the creation of the corpus: collection of the material, tagging and tokenizing, construction of the database and the online implementation which aims at rendering the corpus accessible over Internet to...

متن کامل

Vergina: A Modern Greek Speech Database for Speech Synthesis

The present paper outlines the Vergina speech database, which was developed in support of research and development of corpus-based unit selection and statistical parametric speech synthesis systems for Modern Greek language. In the following, we describe the design, development and implementation of the recording campaign, as well as the annotation of the database. Specifically, a text corpus o...

متن کامل

Corpus based coreference resolution for Farsi text

"Coreference resolution" or "finding all expressions that refer to the same entity" in a text, is one of the important requirements in natural language processing. Two words are coreference when both refer to a single entity in the text or the real world. So the main task of coreference resolution systems is to identify terms that refer to a unique entity. A coreference resolution tool could be...

متن کامل

Discovering Collocations in Modern Greek Language

In this paper two statistical methods for extracting collocations from text corpora written in Modern Greek are described, the mean and variance method and a method based on the X test. The mean and variance method calculates distances (“offsets”) between words in a corpus and looks for specific patterns of distance. The X test is combined with the formulation of a null hypothesis H0 for a samp...

متن کامل

Constructing a segment database for greek time domain speech synthesis

In this article, a methodology is presented regarding the design of a segment database for use with a time-domain speech synthesis system for the Greek language. The main issue of this process is the systematic generation of a corpus containing all possible instances of the segments for the specific language. Particular issues such as the phonetic coverage, the sentence selection as well as ite...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002